Serveur d'exploration sur la recherche en informatique en Lorraine

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus

Identifieur interne : 000261 ( Main/Exploration ); précédent : 000260; suivant : 000262

Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus

Auteurs : Karima Meftouh [Algérie] ; Salima Harrat [Algérie] ; Salma Jamoussi [Tunisie] ; Mourad Abbas [Algérie] ; Kamel Smaili

Source :

RBID : Hal:hal-01261587

Abstract

We present in this paper PADIC, a Parallel Arabic DIalect Corpus we built from scratch, then we conducted experiments on cross-dialect Arabic machine translation. PADIC is composed of dialects from both the Maghreb and the Middle-East. Each dialect has been aligned with Modern Standard Arabic (MSA). Three dialects from Maghreb are concerned by this study: two from Algeria, one from Tunisia, and two dialects from the Middle-East (Syria and Palestine). PADIC has been built from scratch because the lack of dialect resources. In fact, Arabic dialects in Arab world in general are used in daily life conversations but they are not written. At the best of our knowledge, PADIC, up to now, is the largest corpus in the community working on dialects and especially those concerning Maghreb. PADIC is composed of 6400 sentences for each of the 5 concerned dialects and MSA. We conducted cross-lingual machine translation experiments between all the language pairs. For translating to MSA we interpolated the corresponding Language Model (LM) with a large Arabic corpus based LM. We also studied the impact of language model smoothing techniques on the results of machine translation because this corpus, even it is the largest one, it still very small in comparison to those used for translation of natural languages.

Url:


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus</title>
<author>
<name sortKey="Meftouh, Karima" sort="Meftouh, Karima" uniqKey="Meftouh K" first="Karima" last="Meftouh">Karima Meftouh</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-21502" status="VALID">
<orgName>Laboratoire de Recherche en Informatique</orgName>
<orgName type="acronym">LRI-ANNABA</orgName>
<desc>
<address>
<country key="DZ"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300650" type="direct">
<org type="institution" xml:id="struct-300650" status="VALID">
<orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc>
<address>
<addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author>
<name sortKey="Harrat, Salima" sort="Harrat, Salima" uniqKey="Harrat S" first="Salima" last="Harrat">Salima Harrat</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-101665" status="INCOMING">
<orgName>Ecole Nationale Supérieure d'Informatique (ESI ex-INI)</orgName>
<desc>
<address>
<country key="DZ"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-324511" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-324511" type="direct">
<org type="institution" xml:id="struct-324511" status="INCOMING">
<orgName>Ecole Nationale Supérieure d'Informatique (ESI ex-INI)</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author>
<name sortKey="Jamoussi, Salma" sort="Jamoussi, Salma" uniqKey="Jamoussi S" first="Salma" last="Jamoussi">Salma Jamoussi</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-206882" status="VALID">
<orgName>Multimedia, InfoRmation systems and Advanced Computing Laboratory</orgName>
<orgName type="acronym">MIRACL</orgName>
<desc>
<address>
<addrLine>Route de Tunis, km 10, BP 242, Sakiet Ezziet, 3021 SFAX</addrLine>
<country key="TN"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-350672" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-350672" type="direct">
<org type="institution" xml:id="struct-350672" status="INCOMING">
<orgName>FSEG-Sfax, ISIM-Sfax</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Tunisie</country>
</affiliation>
</author>
<author>
<name sortKey="Abbas, Mourad" sort="Abbas, Mourad" uniqKey="Abbas M" first="Mourad" last="Abbas">Mourad Abbas</name>
<affiliation wicri:level="1">
<hal:affiliation type="institution" xml:id="struct-267396" status="VALID">
<orgName>Centre de Recherche Scientifique et Technique pour le Dévelopement de la Langue Arabe</orgName>
<orgName type="acronym">CRSTDLA</orgName>
<desc>
<address>
<addrLine>1,Rue Djamel Eddine EL-Afghani B.P :225. Rostomia-Bouzareah Alger - 16011</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.crstdla.edu.dz/fr/</ref>
</desc>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author>
<name sortKey="Smaili, Kamel" sort="Smaili, Kamel" uniqKey="Smaili K" first="Kamel" last="Smaili">Kamel Smaili</name>
<affiliation>
<hal:affiliation type="laboratory" xml:id="struct-446632" status="INCOMING">
<orgName>LORIA - UMR 7503,Campus Scientifique - BP 239, 54506 Vandoeuvre-les-Nancy Cedex, France</orgName>
<listRelation>
<relation active="#struct-446629" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-446629" type="direct">
<org type="institution" xml:id="struct-446629" status="INCOMING">
<orgName>Laboratoire LORIA</orgName>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-01261587</idno>
<idno type="halId">hal-01261587</idno>
<idno type="halUri">https://hal.archives-ouvertes.fr/hal-01261587</idno>
<idno type="url">https://hal.archives-ouvertes.fr/hal-01261587</idno>
<date when="2015-10-30">2015-10-30</date>
<idno type="wicri:Area/Hal/Corpus">003030</idno>
<idno type="wicri:Area/Hal/Curation">003030</idno>
<idno type="wicri:Area/Hal/Checkpoint">000235</idno>
<idno type="wicri:explorRef" wicri:stream="Hal" wicri:step="Checkpoint">000235</idno>
<idno type="wicri:Area/Main/Merge">000261</idno>
<idno type="wicri:Area/Main/Curation">000261</idno>
<idno type="wicri:Area/Main/Exploration">000261</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus</title>
<author>
<name sortKey="Meftouh, Karima" sort="Meftouh, Karima" uniqKey="Meftouh K" first="Karima" last="Meftouh">Karima Meftouh</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-21502" status="VALID">
<orgName>Laboratoire de Recherche en Informatique</orgName>
<orgName type="acronym">LRI-ANNABA</orgName>
<desc>
<address>
<country key="DZ"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300650" type="direct">
<org type="institution" xml:id="struct-300650" status="VALID">
<orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc>
<address>
<addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author>
<name sortKey="Harrat, Salima" sort="Harrat, Salima" uniqKey="Harrat S" first="Salima" last="Harrat">Salima Harrat</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-101665" status="INCOMING">
<orgName>Ecole Nationale Supérieure d'Informatique (ESI ex-INI)</orgName>
<desc>
<address>
<country key="DZ"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-324511" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-324511" type="direct">
<org type="institution" xml:id="struct-324511" status="INCOMING">
<orgName>Ecole Nationale Supérieure d'Informatique (ESI ex-INI)</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author>
<name sortKey="Jamoussi, Salma" sort="Jamoussi, Salma" uniqKey="Jamoussi S" first="Salma" last="Jamoussi">Salma Jamoussi</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-206882" status="VALID">
<orgName>Multimedia, InfoRmation systems and Advanced Computing Laboratory</orgName>
<orgName type="acronym">MIRACL</orgName>
<desc>
<address>
<addrLine>Route de Tunis, km 10, BP 242, Sakiet Ezziet, 3021 SFAX</addrLine>
<country key="TN"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-350672" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-350672" type="direct">
<org type="institution" xml:id="struct-350672" status="INCOMING">
<orgName>FSEG-Sfax, ISIM-Sfax</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Tunisie</country>
</affiliation>
</author>
<author>
<name sortKey="Abbas, Mourad" sort="Abbas, Mourad" uniqKey="Abbas M" first="Mourad" last="Abbas">Mourad Abbas</name>
<affiliation wicri:level="1">
<hal:affiliation type="institution" xml:id="struct-267396" status="VALID">
<orgName>Centre de Recherche Scientifique et Technique pour le Dévelopement de la Langue Arabe</orgName>
<orgName type="acronym">CRSTDLA</orgName>
<desc>
<address>
<addrLine>1,Rue Djamel Eddine EL-Afghani B.P :225. Rostomia-Bouzareah Alger - 16011</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.crstdla.edu.dz/fr/</ref>
</desc>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author>
<name sortKey="Smaili, Kamel" sort="Smaili, Kamel" uniqKey="Smaili K" first="Kamel" last="Smaili">Kamel Smaili</name>
<affiliation>
<hal:affiliation type="laboratory" xml:id="struct-446632" status="INCOMING">
<orgName>LORIA - UMR 7503,Campus Scientifique - BP 239, 54506 Vandoeuvre-les-Nancy Cedex, France</orgName>
<listRelation>
<relation active="#struct-446629" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-446629" type="direct">
<org type="institution" xml:id="struct-446629" status="INCOMING">
<orgName>Laboratoire LORIA</orgName>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">We present in this paper PADIC, a Parallel Arabic DIalect Corpus we built from scratch, then we conducted experiments on cross-dialect Arabic machine translation. PADIC is composed of dialects from both the Maghreb and the Middle-East. Each dialect has been aligned with Modern Standard Arabic (MSA). Three dialects from Maghreb are concerned by this study: two from Algeria, one from Tunisia, and two dialects from the Middle-East (Syria and Palestine). PADIC has been built from scratch because the lack of dialect resources. In fact, Arabic dialects in Arab world in general are used in daily life conversations but they are not written. At the best of our knowledge, PADIC, up to now, is the largest corpus in the community working on dialects and especially those concerning Maghreb. PADIC is composed of 6400 sentences for each of the 5 concerned dialects and MSA. We conducted cross-lingual machine translation experiments between all the language pairs. For translating to MSA we interpolated the corresponding Language Model (LM) with a large Arabic corpus based LM. We also studied the impact of language model smoothing techniques on the results of machine translation because this corpus, even it is the largest one, it still very small in comparison to those used for translation of natural languages.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Algérie</li>
<li>Tunisie</li>
</country>
</list>
<tree>
<noCountry>
<name sortKey="Smaili, Kamel" sort="Smaili, Kamel" uniqKey="Smaili K" first="Kamel" last="Smaili">Kamel Smaili</name>
</noCountry>
<country name="Algérie">
<noRegion>
<name sortKey="Meftouh, Karima" sort="Meftouh, Karima" uniqKey="Meftouh K" first="Karima" last="Meftouh">Karima Meftouh</name>
</noRegion>
<name sortKey="Abbas, Mourad" sort="Abbas, Mourad" uniqKey="Abbas M" first="Mourad" last="Abbas">Mourad Abbas</name>
<name sortKey="Harrat, Salima" sort="Harrat, Salima" uniqKey="Harrat S" first="Salima" last="Harrat">Salima Harrat</name>
</country>
<country name="Tunisie">
<noRegion>
<name sortKey="Jamoussi, Salma" sort="Jamoussi, Salma" uniqKey="Jamoussi S" first="Salma" last="Jamoussi">Salma Jamoussi</name>
</noRegion>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000261 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000261 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Hal:hal-01261587
   |texte=   Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022